Into the Tidyverse

Common tasks in working with data include actions like removing rows or columns, performing calculations, or adding new columns. This sort of operations is known as data manipulation. It is the process of cleaning, organizing, and transforming raw data into a more structured and usable format for analysis.

In this workshop, we’ll guide you through the process of data manipulation in R, starting with the tidyverse. The tidyverse is a collection of packages that align with a data science philosophy developed by Hadley Wickham and the RStudio team. Many users find it to be a more intuitive way to grasp R concepts. Although some tasks may be more straightforward in base-R, we’ll also highlight those aspects.

If you’ve already installed the tidyverse package (if not, you can do so by running the command: install.packages("tidyverse")), let’s proceed to load it into our R session:

library(tidyverse)

The following are key techniques outlined in Hadley Wickham and Garrett Grolemund’s book, R for Data Science:

This workshop focuses on:

This workflow is of utmost importance. Instead of constructing analyses based on the unconventional format of your data, take measures to tidy up your data. Tidy data enables the use of number of analytical and visualization tools. It eliminates the need to develop ad-hoc methods to accommodate your data. This not only saves time but also enhances the clarity and comprehensibility of your work, benefiting both your collaborators and, most importantly, your future self.

Step 1: Importing data

First we need to import data into our R session. This can be achieved either by using datasets bundled with R packages or by importing external data into our workspace for data manipulation. This is the first step in the tidyverse workflow.

Preloaded Data in R Packages

A great way to learn data science tools is using the data provided by R packages. The data() function is a convenient way to explore and import pre-loaded datasets that come bundled with the R environment.

library(tidyverse)
data()
View Output

To import pre-loaded datasets bundled with a particular package (e.g., tidyr):

data(package = "tidyr")
View Output

When you load a package the pre-loaded datasets automatically imported into you R environment. Therefore you can access them directly as follows.

household
Output
# A tibble: 5 × 5
  family dob_child1 dob_child2 name_child1 name_child2
   <int> <date>     <date>     <chr>       <chr>      
1      1 1998-11-26 2000-01-29 Susan       Jose       
2      2 1996-06-22 NA         Mark        <NA>       
3      3 2002-07-11 2004-04-05 Sam         Seth       
4      4 2004-10-10 2009-08-27 Craig       Khai       
5      5 2000-12-05 2005-02-28 Parker      Gracie     
?household
View Output
<!DOCTYPE html> R: Household data
household R Documentation

Household data

Description

This dataset is based on an example in vignette(“datatable-reshape”, package = “data.table”)

Usage

household

Format

A data frame with 5 rows and 5 columns:

family

Family identifier

dob_child1

Date of birth of first child

dob_child2

Date of birth of second child

name_child1

Name of first child

?

name_child2

Name of second child

Reading the Data

At some point, you want to apply what you’ve learned to your own data. In this section, you’ll learn the basics of reading data files into R using the readr package. The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data.

We will use the read_csv() function from readr package to import a dataset. (See also read.csv() in base R.) CSV short for Comma Separated Values, is a text format commonly used to store tabular data. Conventionally the first line contains column headings.

The first argument of the read_csv() function takes the path to the file (or a web link). The following code will work if the cms_hospital_patient_satisfaction_2016_sampled.csv file is in the data/patient_satisfaction path.

# here the first argument is a path
cms_data <- read_csv("data/patient_satisfaction/cms_hospital_patient_satisfaction_2016_sampled.csv")
Output
Rows: 15 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ID, Facility Name, County, Hospital Type
dbl (4): Star Rating, No of Surveys, Response Rate, Overall Rating

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# here the first argument is a web link
mtvcars <- read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv")
Output
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This command prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).

Other file types

Once you’ve mastered read_csv(), using other functions in the readr package is straightforward. It’s just a matter of knowing which function to use:

  • read_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker or thousands seperator.
  • read_tsv()reads tab-delimited files.
  • read_delim() reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.

Exploring the Data

In the previous section we imported a dataset that is bundled with the tidyr package, into a dataframe named household. In the next section we read in a CSV file and created a data frame named cms_data. This section demonstrates different ways to get to know these two data objects.

The class() function is used to identify the data type or data structure of an object (or variable):

class(household)
class(cms_data)
Output
[1] "tbl_df"     "tbl"        "data.frame"
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

When the name of the object (data frame) is typed, the first few lines along with some information, such as the number of rows are displayed:

cms_data
Output
# A tibble: 15 × 8
   ID     `Facility Name`   County `Hospital Type` `Star Rating` `No of Surveys`
   <chr>  <chr>             <chr>  <chr>                   <dbl>           <dbl>
 1 050424 SCRIPPS GREEN HO… SAN D… Acute Care Hos…             4            3110
 2 140103 ST BERNARD HOSPI… COOK   Acute Care Hos…             1             264
 3 100051 SOUTH LAKE HOSPI… LAKE   Acute Care Hos…             2            1382
 4 040062 MERCY HOSPITAL F… SEBAS… Acute Care Hos…             3            2506
 5 440048 BAPTIST MEMORIAL… SHELBY Acute Care Hos…             2            1799
 6 450011 ST JOSEPH REGION… BRAZOS Acute Care Hos…             3            1379
 7 151317 GREENE COUNTY GE… GREENE Critical Acces…             3             114
 8 061327 SOUTHWEST MEMORI… MONTE… Critical Acces…             4             247
 9 490057 SENTARA GENERAL … VIRGI… Acute Care Hos…             4             619
10 110215 PIEDMONT FAYETTE… FAYET… Acute Care Hos…             2            1714
11 050704 MISSION COMMUNIT… LOS A… Acute Care Hos…             3             241
12 100296 DOCTORS HOSPITAL  MIAMI… Acute Care Hos…             4             393
13 440003 SUMNER REGIONAL … SUMNER Acute Care Hos…             4             680
14 501339 WHIDBEY GENERAL … ISLAND Critical Acces…             3             389
15 050116 NORTHRIDGE MEDIC… LOS A… Acute Care Hos…             3            1110
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>

The dim() function prints the dimensions (rows x columns) of the data frame:

dim(cms_data)
Output
[1] 15  8

This information is available at the environment pane in the top right panel as the number of observations (rows) and variables (columns).

The nrow() function prints the number of rows while ncol() prints the number of columns:

nrow(cms_data)
ncol(cms_data)
Output
[1] 15
[1] 8

The View() function gives a spreadsheet-like view of the data frame:

View(cms_data)

By clicking the object on the environment tab also gives a spreadsheet-like view of the object:

The glimpse()function (dplyr package) displays a compact summary of the data frame, showing you key details such as the data types of each column, the first few values, and the total number of observations.

glimpse(cms_data)
Output
Rows: 15
Columns: 8
$ ID               <chr> "050424", "140103", "100051", "040062", "440048", "45…
$ `Facility Name`  <chr> "SCRIPPS GREEN HOSPITAL", "ST BERNARD HOSPITAL", "SOU…
$ County           <chr> "SAN DIEGO", "COOK", "LAKE", "SEBASTIAN", "SHELBY", "…
$ `Hospital Type`  <chr> "Acute Care Hospital", "Acute Care Hospital", "Acute …
$ `Star Rating`    <dbl> 4, 1, 2, 3, 2, 3, 3, 4, 4, 2, 3, 4, 4, 3, 3
$ `No of Surveys`  <dbl> 3110, 264, 1382, 2506, 1799, 1379, 114, 247, 619, 171…
$ `Response Rate`  <dbl> 41, 6, 20, 35, 18, 24, 22, 34, 32, 21, 14, 24, 35, 29…
$ `Overall Rating` <dbl> 5, 2, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2

The head() function prints the top 6 rows of a data frame:

head(cms_data)
Output
# A tibble: 6 × 8
  ID     `Facility Name`    County `Hospital Type` `Star Rating` `No of Surveys`
  <chr>  <chr>              <chr>  <chr>                   <dbl>           <dbl>
1 050424 SCRIPPS GREEN HOS… SAN D… Acute Care Hos…             4            3110
2 140103 ST BERNARD HOSPIT… COOK   Acute Care Hos…             1             264
3 100051 SOUTH LAKE HOSPIT… LAKE   Acute Care Hos…             2            1382
4 040062 MERCY HOSPITAL FO… SEBAS… Acute Care Hos…             3            2506
5 440048 BAPTIST MEMORIAL … SHELBY Acute Care Hos…             2            1799
6 450011 ST JOSEPH REGIONA… BRAZOS Acute Care Hos…             3            1379
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>

Similarly, the tail() function prints the bottom 6 rows of the data frame:

tail(cms_data)
Output
# A tibble: 6 × 8
  ID     `Facility Name`    County `Hospital Type` `Star Rating` `No of Surveys`
  <chr>  <chr>              <chr>  <chr>                   <dbl>           <dbl>
1 110215 PIEDMONT FAYETTE … FAYET… Acute Care Hos…             2            1714
2 050704 MISSION COMMUNITY… LOS A… Acute Care Hos…             3             241
3 100296 DOCTORS HOSPITAL   MIAMI… Acute Care Hos…             4             393
4 440003 SUMNER REGIONAL M… SUMNER Acute Care Hos…             4             680
5 501339 WHIDBEY GENERAL H… ISLAND Critical Acces…             3             389
6 050116 NORTHRIDGE MEDICA… LOS A… Acute Care Hos…             3            1110
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>

The colnames() function displays all the column names:

colnames(cms_data)
[1] "ID"             "Facility Name"  "County"         "Hospital Type" 
[5] "Star Rating"    "No of Surveys"  "Response Rate"  "Overall Rating"

The $ symbol allows access to individual columns. To display ‘Hospital Type’ column:

cms_data$Hospital Type
Error: <text>:1:19: unexpected symbol
1: cms_data$Hospital Type
                      ^

Since the column names contain spaces, they need to be enclosed within `` (backticks) for R to interpret it as a single variable:

cms_data$`Hospital Type` # or cms_data$"Hospital Type"

It is a good practice to rename all the columns with spaces into a format that R can interpret. Conventionally, _ (underscore) is used to separate words in column names and variables instead of (space). We can rename a single column using the rename() function:

cms_data <- rename(cms_data, Hospital_Type = "Hospital Type")
head(cms_data)
Output
# A tibble: 6 × 8
  ID     `Facility Name`      County Hospital_Type `Star Rating` `No of Surveys`
  <chr>  <chr>                <chr>  <chr>                 <dbl>           <dbl>
1 050424 SCRIPPS GREEN HOSPI… SAN D… Acute Care H…             4            3110
2 140103 ST BERNARD HOSPITAL  COOK   Acute Care H…             1             264
3 100051 SOUTH LAKE HOSPITAL  LAKE   Acute Care H…             2            1382
4 040062 MERCY HOSPITAL FORT… SEBAS… Acute Care H…             3            2506
5 440048 BAPTIST MEMORIAL HO… SHELBY Acute Care H…             2            1799
6 450011 ST JOSEPH REGIONAL … BRAZOS Acute Care H…             3            1379
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>

Note: To modify the original cms_data object, it is essential to assign the renamed object to cms_data. Otherwise, a copy of the cms_data object is changed.

Or by directly assigning a new column name:

colnames(cms_data)[2] <- "Facility_Name"
head(cms_data)
Output
# A tibble: 6 × 8
  ID     Facility_Name        County Hospital_Type `Star Rating` `No of Surveys`
  <chr>  <chr>                <chr>  <chr>                 <dbl>           <dbl>
1 050424 SCRIPPS GREEN HOSPI… SAN D… Acute Care H…             4            3110
2 140103 ST BERNARD HOSPITAL  COOK   Acute Care H…             1             264
3 100051 SOUTH LAKE HOSPITAL  LAKE   Acute Care H…             2            1382
4 040062 MERCY HOSPITAL FORT… SEBAS… Acute Care H…             3            2506
5 440048 BAPTIST MEMORIAL HO… SHELBY Acute Care H…             2            1799
6 450011 ST JOSEPH REGIONAL … BRAZOS Acute Care H…             3            1379
# ℹ 2 more variables: `Response Rate` <dbl>, `Overall Rating` <dbl>

Using the above methods to rename multiple columns just to replace space with underscore can be laborious. A quick way to replace all spaces in the column names with underscore is shown below.

Option 1:

# by string substituting _ in place of ' '(space)
colnames(cms_data) <- gsub(" ", "_", colnames(cms_data))
colnames(cms_data)
Output
[1] "ID"             "Facility_Name"  "County"         "Hospital_Type" 
[5] "Star_Rating"    "No_of_Surveys"  "Response_Rate"  "Overall_Rating"

Here the gsub() function replace space with underscores in the column names of the cms_data. This new object needs to be assigned (<-) to colnames(cms_data) in order for the changes to be saved in the cms_data object.

Option 2:

Here we use the janitor package which contains functions to clean data. You first need to install and load the library before using it.

# using janitor package
library(janitor) # remember to install janitor: install.packages("janitor")
cms_data <- cms_data |> clean_names()
cms_data
Output
# A tibble: 15 × 8
   id     facility_name           county hospital_type star_rating no_of_surveys
   <chr>  <chr>                   <chr>  <chr>               <dbl>         <dbl>
 1 050424 SCRIPPS GREEN HOSPITAL  SAN D… Acute Care H…           4          3110
 2 140103 ST BERNARD HOSPITAL     COOK   Acute Care H…           1           264
 3 100051 SOUTH LAKE HOSPITAL     LAKE   Acute Care H…           2          1382
 4 040062 MERCY HOSPITAL FORT SM… SEBAS… Acute Care H…           3          2506
 5 440048 BAPTIST MEMORIAL HOSPI… SHELBY Acute Care H…           2          1799
 6 450011 ST JOSEPH REGIONAL HEA… BRAZOS Acute Care H…           3          1379
 7 151317 GREENE COUNTY GENERAL … GREENE Critical Acc…           3           114
 8 061327 SOUTHWEST MEMORIAL HOS… MONTE… Critical Acc…           4           247
 9 490057 SENTARA GENERAL HOSPIT… VIRGI… Acute Care H…           4           619
10 110215 PIEDMONT FAYETTE HOSPI… FAYET… Acute Care H…           2          1714
11 050704 MISSION COMMUNITY HOSP… LOS A… Acute Care H…           3           241
12 100296 DOCTORS HOSPITAL        MIAMI… Acute Care H…           4           393
13 440003 SUMNER REGIONAL MEDICA… SUMNER Acute Care H…           4           680
14 501339 WHIDBEY GENERAL HOSPIT… ISLAND Critical Acc…           3           389
15 050116 NORTHRIDGE MEDICAL CEN… LOS A… Acute Care H…           3          1110
# ℹ 2 more variables: response_rate <dbl>, overall_rating <dbl>

Here we are sending the cms_data data frame into the function clean_names() which replaces replaces spaces with underscore in column names. Note that it also convert all characters in column names to lower case.

The default format is snake case styled as snake_case. You can specify other formats like camel case, title case.

Option 3:

Additionally, we can use the make.names() function in base R which replaces ’ ’ (spaces) with . (dots).

# using make.names() function
colnames(cms_data) <-  make.names(colnames(cms_data))
colnames(cms_data) 
Output
[1] "ID"             "Facility_Name"  "County"         "Hospital_Type" 
[5] "Star.Rating"    "No.of.Surveys"  "Response.Rate"  "Overall.Rating"

The str() function shows the structure of the data:

str(cms_data)
Output
spc_tbl_ [15 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID            : chr [1:15] "050424" "140103" "100051" "040062" ...
 $ Facility_Name : chr [1:15] "SCRIPPS GREEN HOSPITAL" "ST BERNARD HOSPITAL" "SOUTH LAKE HOSPITAL" "MERCY HOSPITAL FORT SMITH" ...
 $ County        : chr [1:15] "SAN DIEGO" "COOK" "LAKE" "SEBASTIAN" ...
 $ Hospital_Type : chr [1:15] "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" ...
 $ Star Rating   : num [1:15] 4 1 2 3 2 3 3 4 4 2 ...
 $ No of Surveys : num [1:15] 3110 264 1382 2506 1799 ...
 $ Response Rate : num [1:15] 41 6 20 35 18 24 22 34 32 21 ...
 $ Overall Rating: num [1:15] 5 2 2 3 2 3 3 3 3 2 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_character(),
  ..   `Facility Name` = col_character(),
  ..   County = col_character(),
  ..   `Hospital Type` = col_character(),
  ..   `Star Rating` = col_double(),
  ..   `No of Surveys` = col_double(),
  ..   `Response Rate` = col_double(),
  ..   `Overall Rating` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

The summary() function generates summary statistics:

summary(cms_data)
Output
      ID            Facility_Name         County          Hospital_Type     
 Length:15          Length:15          Length:15          Length:15         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  Star Rating  No of Surveys    Response Rate Overall Rating 
 Min.   :1.0   Min.   : 114.0   Min.   : 6    Min.   :2.000  
 1st Qu.:2.5   1st Qu.: 326.5   1st Qu.:20    1st Qu.:2.000  
 Median :3.0   Median : 680.0   Median :24    Median :3.000  
 Mean   :3.0   Mean   :1063.1   Mean   :25    Mean   :2.733  
 3rd Qu.:4.0   3rd Qu.:1548.0   3rd Qu.:33    3rd Qu.:3.000  
 Max.   :4.0   Max.   :3110.0   Max.   :41    Max.   :5.000  

A statitical overview can be obtained using the skim() function in skimr package:

library(skimr)
skim(cms_data)
Output
Data summary
Name cms_data
Number of rows 15
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1 6 6 0 15 0
Facility_Name 0 1 16 32 0 15 0
County 0 1 4 14 0 14 0
Hospital_Type 0 1 19 24 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Star Rating 0 1 3.00 0.93 1 2.5 3 4 4 ▁▃▁▇▇
No of Surveys 0 1 1063.13 909.05 114 326.5 680 1548 3110 ▇▁▃▁▁
Response Rate 0 1 25.00 9.30 6 20.0 24 33 41 ▂▇▇▆▆
Overall Rating 0 1 2.73 0.80 2 2.0 3 3 5 ▆▇▁▁▁

Writing Data to a File

Writing data to a file is a fundamental operation in programming and data analysis. It involves taking data from within a program or environment and storing it in a file on a disk for later use or sharing. This section explains the basics of writing a data file using the readr package.

The write_csv() and write_tsv() functions are part of the readr package, which is designed for writing delimited files like CSV (comma-separated values) and TSV (tab-separated values). These functions are used to write data frames into CSV and TSV files, respectively.

We first provide the variable name of the data frame followed by the file name (ideally including the full folder location).

To write a CSV file:

# on Mac:
write_csv(cms_data, "~/Desktop/cms_data.csv")

# on Windows
write_csv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")

To write a TSV file:

# on Mac:
write_tsv(cms_data, "~/Desktop/cms_data.csv")

# on Windows
write_tsv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")

Step 2: Tidy Data

Tidy data is a structured and organized format for presenting data that follows a simple convention: variables are placed in columns, observations are placed in rows and values are placed in cells. This standardized arrangement makes it easy to work with and analyze data efficiently. The principles of tidy data, popularized by Hadley Wickham, are designed to promote consistency and ease of use in data analysis.

This is the second step in the tidyverse workflow.

Let’s take a look at some examples.

Data is often entered in a wide format, where each row typically represents a site, subject, or patient, and there are multiple observation variables containing the same type of data.

For instance, consider the AirPassengers dataset. It contains information on monthly airline passenger numbers from 1949 to 1960. In this dataset, each row corresponds to a single year, and the columns represent each month from January to December.

AirPassengers
Output
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

Wide format is intuitive for data entry. But it is less so for data analysis. Consider calculating the monthly mean; where would you place it? Would it be another row?

Data needs to be reshaped to conform to the tidy data structure. It involves using two primary verbs (or pairs of opposites):

  • Convert columns into rows (pivot_longer()).
  • Convert rows into columns (pivot_wider()).
  • Convert a character column into multiple columns (separate_wider_delim() and separate_wider_position()).
  • Combine multiple character columns into a single column (unite()).

First, load the tidyr package. Since you have already installed the tidyverse, you should be able to load it directly as follows (otherwise install it using the command install.packages("tidyverse") if necessary):

library(tidyverse)

Converting data from wide to long format

First read the counts file called GSE60450_normalized_data.csv that is in a folder called data (i.e. the path to the file should be data/GSE60450_normalized_data.csv).

counts <- read_csv("data/GSE60450_normalized_data.csv")
head(counts)
Output
# A tibble: 6 × 14
  X           gene_symbol GSM1480291 GSM1480292 GSM1480293 GSM1480294 GSM1480295
  <chr>       <chr>            <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1 ENSMUSG000… Jak3           82.2       81.2          36.1    36.6       12.4   
2 ENSMUSG000… Usp36          88.2       94.6          76.1    63.7       27.1   
3 ENSMUSG000… Mir135a-2       0          0             0       0          0     
4 ENSMUSG000… Strn4          94.3       85.1          59.5    54.3       24.4   
5 ENSMUSG000… Mir6921         0.0895     0.0840        0       0.0846     0.0388
6 ENSMUSG000… Rubcn          37.8       37.7          21.3    27.7       30.1   
# ℹ 7 more variables: GSM1480296 <dbl>, GSM1480297 <dbl>, GSM1480298 <dbl>,
#   GSM1480299 <dbl>, GSM1480300 <dbl>, GSM1480301 <dbl>, GSM1480302 <dbl>

To transform this table from a wide format to a long format, we use the pivot_longer() function. It’s important to note that this function does not create tidy data as it duplicates rows. However, the output in ‘long format’ from pivot_longer() is often necessary for ggplot, where each aesthetic or facet category must be a single column of values and for left_join(), which will be introduced later.

This operation will convert multiple columns with counts for each sample into a single column containing all the expression values, as illustrated in the image below.

The pivot_longer() function takes three arguments:

  1. cols = : a vector indicating the names of the columns to be converted into labels in long form.
  2. names_to = : a name or vector of names for the new column(s) containing the labels from the specified columns.
  3. **values_to =* *: a name for the new column containing the values corresponding to the specified columns.

It’s important to note that when using pivot_wider(), the new column names need to be enclosed in quotes.

seqdata <- counts |> 
  pivot_longer(cols = starts_with("GSM"), 
               names_to = "Sample", 
               values_to = "Count")

The cols = starts_with("GSM") command returns a vector of columns whose names starts with “GSM”. pivot_longer() will then transform the those columns into two new columns, denoted as “Sample” and “Count.” The parameter names_to = "Sample" indicates that the new column containing the specified columns (defined by cols) should be named “Sample,” while values_to = "Count" specifies that the new column containing the values should be named “Count.”

seqdata
Output
# A tibble: 240 × 4
   X                  gene_symbol Sample     Count
   <chr>              <chr>       <chr>      <dbl>
 1 ENSMUSG00000031805 Jak3        GSM1480291 82.2 
 2 ENSMUSG00000031805 Jak3        GSM1480292 81.2 
 3 ENSMUSG00000031805 Jak3        GSM1480293 36.1 
 4 ENSMUSG00000031805 Jak3        GSM1480294 36.6 
 5 ENSMUSG00000031805 Jak3        GSM1480295 12.4 
 6 ENSMUSG00000031805 Jak3        GSM1480296 11.9 
 7 ENSMUSG00000031805 Jak3        GSM1480297 10.6 
 8 ENSMUSG00000031805 Jak3        GSM1480298 14.9 
 9 ENSMUSG00000031805 Jak3        GSM1480299  7.57
10 ENSMUSG00000031805 Jak3        GSM1480300  7.06
# ℹ 230 more rows

Alternatively, we could achieve the same outcome by specifying a column range using the following command:

seqdata <- counts |> 
  pivot_longer(cols = GSM1480291:GSM1480302, 
               names_to = "Sample", 
               values_to = "Count")

We can also specify the columns we don’t want to reformat, and pivot_longer() will then reformat all the columns except those. To achieve this, we place a minus sign (“-”) in front of the column names that we wish to exclude. This is a commonly used approach with pivot_longer(), as it can be more convenient to exclude columns we don’t need rather than explicitly include the ones we want.

seqdata <- counts |> 
  pivot_longer(cols = -c(X, gene_symbol), 
               names_to = "Sample", 
               values_to = "Count")

Converting data from long to wide format

First, read the annotation file called GSE60450_annotation.csv (the path to the file should be data/GSE60450_annotation.csv).

annot <- read_csv("data/GSE60450_annotation.csv")
head(annot)
Output
# A tibble: 6 × 3
  ENSEMBL            Type     Annotation                     
  <chr>              <chr>    <chr>                          
1 ENSMUSG00000031805 SYMBOL   Jak3                           
2 ENSMUSG00000031805 GENENAME Janus kinase 3                 
3 ENSMUSG00000033909 SYMBOL   Usp36                          
4 ENSMUSG00000033909 GENENAME ubiquitin specific peptidase 36
5 ENSMUSG00000065524 SYMBOL   Mir135a-2                      
6 ENSMUSG00000065524 GENENAME microRNA 135a-2                

To transform this table so that it conforms to the tidy principles, we use the pivot_wider() function.

This operation will convert multiple rows with type and annotation into columns containing the Symbol and Gene_name, as illustrated in the image below.

The pivot_wider() function takes two arguments:

  1. names_from = : a name or a vector of names of column(s) containing the labels that will be transformed into the new column names.
  2. values_from = : a name or a vector of names of column(s) containing the values that will fill the new columns.

In our scenario, to reshape the annot data frame, we will use the column names Type and Annotation:

annot_tidy <- annot |> 
  pivot_wider(names_from = Type, 
              values_from = Annotation)

The above operation changes the ‘shape’ of the dataframe from a longer format (more rows) to a wider format (more columns). While the original table consists of 40 rows, using pivot_wider() results in only 20 rows. This reduction is due to the de-duplication of rows during the creation of new columns.

annot_tidy
Output
# A tibble: 20 × 3
   ENSEMBL            SYMBOL        GENENAME                                    
   <chr>              <chr>         <chr>                                       
 1 ENSMUSG00000031805 Jak3          Janus kinase 3                              
 2 ENSMUSG00000033909 Usp36         ubiquitin specific peptidase 36             
 3 ENSMUSG00000065524 Mir135a-2     microRNA 135a-2                             
 4 ENSMUSG00000030374 Strn4         striatin, calmodulin binding protein 4      
 5 ENSMUSG00000098547 Mir6921       microRNA 6921                               
 6 ENSMUSG00000035629 Rubcn         RUN domain and cysteine-rich domain contain…
 7 ENSMUSG00000094053 Scgb2b7       secretoglobin, family 2B, member 7          
 8 ENSMUSG00000055491 Pprc1         peroxisome proliferative activated receptor…
 9 ENSMUSG00000053080 Zfta          zinc finger translocation associated        
10 ENSMUSG00000039715 Dync2i2       dynein 2 intermediate chain 2               
11 ENSMUSG00000033475 Tomm6         translocase of outer mitochondrial membrane…
12 ENSMUSG00000026283 Ing5          inhibitor of growth family, member 5        
13 ENSMUSG00000037331 Larp1         La ribonucleoprotein 1, translational regul…
14 ENSMUSG00000074489 Bglap3        bone gamma-carboxyglutamate protein 3       
15 ENSMUSG00000038246 Fam50b        family with sequence similarity 50, member B
16 ENSMUSG00000066189 Cacng3        calcium channel, voltage-dependent, gamma s…
17 ENSMUSG00000005611 Irag1         inositol 1,4,5-triphosphate receptor associ…
18 ENSMUSG00000064299 4921528I07Rik RIKEN cDNA 4921528I07 gene                  
19 ENSMUSG00000028174 Rpe65         retinal pigment epithelium 65               
20 ENSMUSG00000024902 Mrpl11        mitochondrial ribosomal protein L11         

It’s important to note that since we only have two distinct labels in the Type column, we are essentially replacing the existing two columns with just two new columns. Consequently, the shape of the output doesn’t technically become wider than the input data frame. However, when there are more than two unique labels in the names_from column, the output will indeed become wider compared to the input.

Separating Columns

First, read the metadata file called GSE60450_metadata.csv (the path to the file should be data/GSE60450_metadata.csv).

metadata <- read_csv("data/GSE60450_metadata.csv")
head(metadata)
Output
# A tibble: 6 × 2
  gene_id    characteristics                               
  <chr>      <chr>                                         
1 GSM1480291 mammary gland;luminal cells;virgin            
2 GSM1480292 mammary gland;luminal cells;virgin            
3 GSM1480293 mammary gland;luminal cells;18.5 day pregnancy
4 GSM1480294 mammary gland;luminal cells;18.5 day pregnancy
5 GSM1480295 mammary gland;luminal cells;2 day lactation   
6 GSM1480296 mammary gland;luminal cells;2 day lactation   

To transform this table so that it conforms to the tidy principles, we use the separate_wider_position()/separate_wider_delim() function. This operation will separate characteristic column into 3 separate columns containing the tissue_type, immunophenotype and development_stage, as illustrated in the image below.

The separate_wider_delim() function takes three arguments:

  1. cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
  2. delim = : delimeter (or separator) between values. This is same as the delim = in read_delim().
  3. names = : a vector containing column names for the the new columns.

To separate characteristic column in the metadata data frame into three separate columns based on the delimeter ; (semi colon), we can use the separate_wider_delim() function:

metadata_lform <- metadata |> 
  separate_wider_delim(cols = characteristics, 
                       delim =";",
                       names = c("tissue_type", "immunophenotype", "development_stage"))
metadata_lform
Output
# A tibble: 12 × 4
   gene_id    tissue_type   immunophenotype development_stage 
   <chr>      <chr>         <chr>           <chr>             
 1 GSM1480291 mammary gland luminal cells   virgin            
 2 GSM1480292 mammary gland luminal cells   virgin            
 3 GSM1480293 mammary gland luminal cells   18.5 day pregnancy
 4 GSM1480294 mammary gland luminal cells   18.5 day pregnancy
 5 GSM1480295 mammary gland luminal cells   2 day lactation   
 6 GSM1480296 mammary gland luminal cells   2 day lactation   
 7 GSM1480297 mammary gland basal cells     virgin            
 8 GSM1480298 mammary gland basal cells     virgin            
 9 GSM1480299 mammary gland basal cells     18.5 day pregnancy
10 GSM1480300 mammary gland basal cells     18.5 day pregnancy
11 GSM1480301 mammary gland basal cells     2 day lactation   
12 GSM1480302 mammary gland basal cells     2 day lactation   

The separate_wider_position() function splits at fixed widths and takes two arguments:

  1. cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
  2. widths = : a named vector containing numbers where the names become the new column names and values specify the column widths.

For instance, we can divide the gene_id column into three separate columns to evaluate the functionality of this operation (this is provided purely as an example):

metadata_lform |> 
  separate_wider_position(cols = gene_id, 
                          widths = c(code = 3, prefix = 4, id = 3))
Output
# A tibble: 12 × 6
   code  prefix id    tissue_type   immunophenotype development_stage 
   <chr> <chr>  <chr> <chr>         <chr>           <chr>             
 1 GSM   1480   291   mammary gland luminal cells   virgin            
 2 GSM   1480   292   mammary gland luminal cells   virgin            
 3 GSM   1480   293   mammary gland luminal cells   18.5 day pregnancy
 4 GSM   1480   294   mammary gland luminal cells   18.5 day pregnancy
 5 GSM   1480   295   mammary gland luminal cells   2 day lactation   
 6 GSM   1480   296   mammary gland luminal cells   2 day lactation   
 7 GSM   1480   297   mammary gland basal cells     virgin            
 8 GSM   1480   298   mammary gland basal cells     virgin            
 9 GSM   1480   299   mammary gland basal cells     18.5 day pregnancy
10 GSM   1480   300   mammary gland basal cells     18.5 day pregnancy
11 GSM   1480   301   mammary gland basal cells     2 day lactation   
12 GSM   1480   302   mammary gland basal cells     2 day lactation   

Uniting Columns

The unite() function is the complement of separate(). Therefore, let’s revert what we did in the previous section to combine multiple columns to a single column as illustrated in the image below.

The unite() function takes three arguments:

  1. col = : name of the new column that will contain the united values.
  2. … = : a vector containing column names to unite.
  3. sep = : delimeter (or separator) this is same as the delim = in read_delim(). If we don’t specify a separator to insert between the combined values, they will be separated by _ (underscores).

To separate characteristic column in the metadata data frame into three separate columns:

metadata_lform |> 
  unite(col = characteristics, 
        tissue_type, immunophenotype, development_stage,
        sep = ",")
Output
# A tibble: 12 × 2
   gene_id    characteristics                               
   <chr>      <chr>                                         
 1 GSM1480291 mammary gland,luminal cells,virgin            
 2 GSM1480292 mammary gland,luminal cells,virgin            
 3 GSM1480293 mammary gland,luminal cells,18.5 day pregnancy
 4 GSM1480294 mammary gland,luminal cells,18.5 day pregnancy
 5 GSM1480295 mammary gland,luminal cells,2 day lactation   
 6 GSM1480296 mammary gland,luminal cells,2 day lactation   
 7 GSM1480297 mammary gland,basal cells,virgin              
 8 GSM1480298 mammary gland,basal cells,virgin              
 9 GSM1480299 mammary gland,basal cells,18.5 day pregnancy  
10 GSM1480300 mammary gland,basal cells,18.5 day pregnancy  
11 GSM1480301 mammary gland,basal cells,2 day lactation     
12 GSM1480302 mammary gland,basal cells,2 day lactation     

Missing Values

A value can be missing in one of two possible ways:

  • Explicitly, meaning it is flagged with NA.
  • Implicitly, implying that it is just not present in the data.

Let’s illustrate this idea with a very simple data frame:

covid_vac <- data.frame(
    year = c(2020, 2020, 2021, 2021, 2021,  2023, 2023,
             2023, 2024, 2024), 
    vaccine_type = c("Pfizer", "Moderna", "Pfizer", "Moderna", "Novavax", 
                   "Pfizer", "Moderna", 
                  "Novavax", "Moderna", NA),
    count = c(0, 3, 63, 88, 51,
               38, 19,
              5, 9, 7)
)
covid_vac
Output
   year vaccine_type count
1  2020       Pfizer     0
2  2020      Moderna     3
3  2021       Pfizer    63
4  2021      Moderna    88
5  2021      Novavax    51
6  2023       Pfizer    38
7  2023      Moderna    19
8  2023      Novavax     5
9  2024      Moderna     9
10 2024         <NA>     7

In this dataset, we identify two occurrences of missing values:

  1. The vaccine_type in 2024 with a count of 7 is explicitly missing, denoted by the presence of NA in the cell where its value should be.
  2. The counts for the Novavax vaccine in 2020 and Pfizer, Novavax vaccines in 2024, are implicitly missing, as they do not appear in the dataset at all.

is.na()

To identify missing values we can use is.na() function which returns a logical vector with TRUE in the element locations that contain missing values represented by NA.

is.na(covid_vac)
Output
       year vaccine_type count
 [1,] FALSE        FALSE FALSE
 [2,] FALSE        FALSE FALSE
 [3,] FALSE        FALSE FALSE
 [4,] FALSE        FALSE FALSE
 [5,] FALSE        FALSE FALSE
 [6,] FALSE        FALSE FALSE
 [7,] FALSE        FALSE FALSE
 [8,] FALSE        FALSE FALSE
 [9,] FALSE        FALSE FALSE
[10,] FALSE         TRUE FALSE
is.na(covid_vac$vaccine_type)
Output
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

To identify the location or the number of NAs we can use the which() and sum() functions:

which(is.na(covid_vac))
sum(is.na(covid_vac))
Output
[1] 20
[1] 1

na.omit()

To omit all rows containing missing values, we can use na.omit() function in base R:

na.omit(covid_vac)
Output
  year vaccine_type count
1 2020       Pfizer     0
2 2020      Moderna     3
3 2021       Pfizer    63
4 2021      Moderna    88
5 2021      Novavax    51
6 2023       Pfizer    38
7 2023      Moderna    19
8 2023      Novavax     5
9 2024      Moderna     9

complete()

We can use the complete() function to make our dataset more complete or to make missing values explicit in tidy data:

covid_vac |> complete(year, vaccine_type)
Output
# A tibble: 16 × 3
    year vaccine_type count
   <dbl> <chr>        <dbl>
 1  2020 Moderna          3
 2  2020 Novavax         NA
 3  2020 Pfizer           0
 4  2020 <NA>            NA
 5  2021 Moderna         88
 6  2021 Novavax         51
 7  2021 Pfizer          63
 8  2021 <NA>            NA
 9  2023 Moderna         19
10  2023 Novavax          5
11  2023 Pfizer          38
12  2023 <NA>            NA
13  2024 Moderna          9
14  2024 Novavax         NA
15  2024 Pfizer          NA
16  2024 <NA>             7

This function add missing values for potential combinations of year and vaccine_type. One problem is that R assumes NA in status as one of the combinations. To fix this, we can specify the labels of status to be considered as follows:

covid_vac |> complete(year, vaccine_type = c("Pfizer", "Moderna", "Novavax"))
Output
# A tibble: 13 × 3
    year vaccine_type count
   <dbl> <chr>        <dbl>
 1  2020 Moderna          3
 2  2020 Novavax         NA
 3  2020 Pfizer           0
 4  2021 Moderna         88
 5  2021 Novavax         51
 6  2021 Pfizer          63
 7  2023 Moderna         19
 8  2023 Novavax          5
 9  2023 Pfizer          38
10  2024 Moderna          9
11  2024 Novavax         NA
12  2024 Pfizer          NA
13  2024 <NA>             7

We can use the fill argument to assign the fill value:

covid_vac |> complete(year, 
                      vaccine_type = c("Pfizer", "Moderna", "Novavax"),
                       fill = list(count = 0))
Output
# A tibble: 13 × 3
    year vaccine_type count
   <dbl> <chr>        <dbl>
 1  2020 Moderna          3
 2  2020 Novavax          0
 3  2020 Pfizer           0
 4  2021 Moderna         88
 5  2021 Novavax         51
 6  2021 Pfizer          63
 7  2023 Moderna         19
 8  2023 Novavax          5
 9  2023 Pfizer          38
10  2024 Moderna          9
11  2024 Novavax          0
12  2024 Pfizer           0
13  2024 <NA>             7

We can use the full_seq() function from tidyr to fill out the data frame with all years from 2020 to 2024 and assign vaccination types and count values of 0 to those years and for which there was no observation.

covid_vac |> complete(year = full_seq(year, period = 1), 
                      vaccine_type = c("Pfizer", "Moderna", "Novavax"),
                       fill = list(count = 0))
Output
# A tibble: 16 × 3
    year vaccine_type count
   <dbl> <chr>        <dbl>
 1  2020 Moderna          3
 2  2020 Novavax          0
 3  2020 Pfizer           0
 4  2021 Moderna         88
 5  2021 Novavax         51
 6  2021 Pfizer          63
 7  2022 Moderna          0
 8  2022 Novavax          0
 9  2022 Pfizer           0
10  2023 Moderna         19
11  2023 Novavax          5
12  2023 Pfizer          38
13  2024 Moderna          9
14  2024 Novavax          0
15  2024 Pfizer           0
16  2024 <NA>             7

fill()

The fill() function is used to fill missing values in a data frame, particularly within columns.

Let’s first make missing values in the covid_vac dataset explicit and assign it to a data frame named covid_vac_comp.

covid_vac_comp <-  covid_vac |> 
  complete(year = full_seq(year, period = 1), 
           vaccine_type = c("Pfizer", "Moderna", "Novavax"))

We can specify the direction to fill the missing values using the argument .direction. Remember to specify the list of columns to fill.

covid_vac_comp |> fill(count, .direction = "down")
Output
# A tibble: 16 × 3
    year vaccine_type count
   <dbl> <chr>        <dbl>
 1  2020 Moderna          3
 2  2020 Novavax          3
 3  2020 Pfizer           0
 4  2021 Moderna         88
 5  2021 Novavax         51
 6  2021 Pfizer          63
 7  2022 Moderna         63
 8  2022 Novavax         63
 9  2022 Pfizer          63
10  2023 Moderna         19
11  2023 Novavax          5
12  2023 Pfizer          38
13  2024 Moderna          9
14  2024 Novavax          9
15  2024 Pfizer           9
16  2024 <NA>             7

Similarly, we can fill upwards as follows:

covid_vac_comp |> fill(count, .direction = "up")
Output
# A tibble: 16 × 3
    year vaccine_type count
   <dbl> <chr>        <dbl>
 1  2020 Moderna          3
 2  2020 Novavax          0
 3  2020 Pfizer           0
 4  2021 Moderna         88
 5  2021 Novavax         51
 6  2021 Pfizer          63
 7  2022 Moderna         19
 8  2022 Novavax         19
 9  2022 Pfizer          19
10  2023 Moderna         19
11  2023 Novavax          5
12  2023 Pfizer          38
13  2024 Moderna          9
14  2024 Novavax          7
15  2024 Pfizer           7
16  2024 <NA>             7

Once the data is structured and organized according to tidy principles, we can begin manipulating and transforming it. The next section illustrates how this can be accomplished using the dplyr package from the tidyverse package suit.